Assignment 2

Author

Hanin Almodaweb

Loading and Merging the Datasets

The individual data includes personal and health characteristics of children in 12 communities across Southern California. The regional data include air quality measurements at the community level.

# reading the datasets
chs_individual <- read.csv("/Users/neens/Downloads/chs_individual.csv")
chs_regional <- read.csv("/Users/neens/Downloads/chs_regional.csv")

# merging datasets based on the location variable
merged_data <- merge(chs_individual, chs_regional, by = "townname")

# view the first few rows of the merged dataset
head(merged_data, 5)
  townname sid male race hispanic    agepft height weight      bmi asthma
1   Alpine 841    1    W        1 10.548939    150     78 15.75758      0
2   Alpine 835    0    W        0 10.099932    143     69 15.33749      0
3   Alpine 838    0    O        1  9.486653    133     62 15.93183      0
4   Alpine 840    0    W        0  9.965777    146     78 16.63283      0
5   Alpine 865    0    W        0 10.039699    162    140 24.24797      1
  active_asthma father_asthma mother_asthma wheeze hayfever allergy educ_parent
1             0             0             0      0        0       0           5
2             0             0             0      0        0       1           3
3             0             0             0      0        0       0           4
4             0             0             0      0        0       0          NA
5             1             0             0      1        0       1           3
  smoke pets gasstove      fev      fvc     mmef pm25_mass pm25_so4 pm25_no3
1     0    1        0 2251.505 2594.649 2445.151      8.74     1.73     1.59
2     0    1        0 2529.276 2826.316 3406.579      8.74     1.73     1.59
3    NA    1        0 1737.793 1963.545 2133.110      8.74     1.73     1.59
4    NA    0       NA 2466.791 2638.221 3466.464      8.74     1.73     1.59
5     0    1        1 2583.934 3567.541 2071.475      8.74     1.73     1.59
  pm25_nh4 pm25_oc pm25_ec pm25_om pm10_oc pm10_ec pm10_tc formic acetic  hcl
1     0.88    2.54    0.48    3.04    3.25    0.49    3.75   1.03   2.49 0.41
2     0.88    2.54    0.48    3.04    3.25    0.49    3.75   1.03   2.49 0.41
3     0.88    2.54    0.48    3.04    3.25    0.49    3.75   1.03   2.49 0.41
4     0.88    2.54    0.48    3.04    3.25    0.49    3.75   1.03   2.49 0.41
5     0.88    2.54    0.48    3.04    3.25    0.49    3.75   1.03   2.49 0.41
  hno3 o3_max o3106 o3_24   no2  pm10 no_24hr pm2_5_fr iacid oacid total_acids
1 1.98  65.82 55.05 41.23 12.18 24.73    2.48    10.28  2.39  3.52         5.5
2 1.98  65.82 55.05 41.23 12.18 24.73    2.48    10.28  2.39  3.52         5.5
3 1.98  65.82 55.05 41.23 12.18 24.73    2.48    10.28  2.39  3.52         5.5
4 1.98  65.82 55.05 41.23 12.18 24.73    2.48    10.28  2.39  3.52         5.5
5 1.98  65.82 55.05 41.23 12.18 24.73    2.48    10.28  2.39  3.52         5.5
        lon      lat
1 -116.7664 32.83505
2 -116.7664 32.83505
3 -116.7664 32.83505
4 -116.7664 32.83505
5 -116.7664 32.83505

Data Wrangling

  1. After merging the data, make sure you don’t have any duplicates by counting the number of rows. Make sure it matches.
# counting the number of rows in the individual and regional datasets
nrow_individual <- nrow(chs_individual)
nrow_regional <- nrow(chs_regional)

cat("Number of rows in the individual dataset: ", nrow_individual, "\n")
Number of rows in the individual dataset:  1200 
cat("Number of rows in the regional dataset: ", nrow_regional, "\n")
Number of rows in the regional dataset:  12 
# counting the number of rows in the merged dataset
nrow_merged <- nrow(merged_data)
cat("Number of rows in the merged dataset: ", nrow_merged, "\n")
Number of rows in the merged dataset:  1200 
# checking for duplicates in the merged dataset
duplicates <- sum(duplicated(merged_data))
cat("Number of duplicate rows in the merged dataset: ", duplicates, "\n")
Number of duplicate rows in the merged dataset:  0 

The individual and merged datasets both contain 1,200 rows, indicating that no duplicate observations are present in the merged dataset.

In the case of missing values, impute data using the average amongst individuals with the same values for the “male” and “hispanic” variables. For categorical variables, take the mode.

# function to calculate mode
get_mode <- function(v) {
  uniqv <- unique(v[!is.na(v)])  # Exclude NA values
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

# for numerical variables: Impute using group mean (based on male and hispanic)
numerical_vars <- c("agepft", "height", "weight", "bmi", "fev", "fvc", "mmef", "pm25_mass", "pm25_so4", "pm25_no3", "pm25_nh4", "pm25_oc", "pm25_ec", "pm25_om", "pm10_oc", "pm10_ec", "pm10_tc", 
"formic", "acetic", "hcl", "hno3", "o3_max", "o3106", "o3_24", "no2", "pm10", "no_24hr", "pm2_5_fr", "iacid", "oacid", "total_acids", "lon", "lat")

# impute missing numerical values
merged_data <- merged_data %>%
  group_by(male, hispanic) %>%
  mutate(across(all_of(numerical_vars), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))

# for categorical variables: Impute using group mode (based on male and hispanic)
categorical_vars <- c("townname", "race", "asthma", "active_asthma", "father_asthma", "mother_asthma", "wheeze", "hayfever", "allergy", "educ_parent", "smoke", "pets", "gasstove")

# impute missing categorical values
merged_data <- merged_data %>%
  group_by(male, hispanic) %>%
  mutate(across(all_of(categorical_vars), ~ ifelse(is.na(.), get_mode(.), .)))

# ungroup after imputation
merged_data <- ungroup(merged_data)

# check the updated data to make sure there are no missing values
summary(merged_data)
   townname              sid              male            race          
 Length:1200        Min.   :   1.0   Min.   :0.0000   Length:1200       
 Class :character   1st Qu.: 528.8   1st Qu.:0.0000   Class :character  
 Mode  :character   Median :1041.5   Median :0.0000   Mode  :character  
                    Mean   :1037.5   Mean   :0.4917                     
                    3rd Qu.:1554.2   3rd Qu.:1.0000                     
                    Max.   :2053.0   Max.   :1.0000                     
    hispanic          agepft           height        weight      
 Min.   :0.0000   Min.   : 8.961   Min.   :114   Min.   : 42.00  
 1st Qu.:0.0000   1st Qu.: 9.632   1st Qu.:135   1st Qu.: 66.00  
 Median :0.0000   Median : 9.907   Median :139   Median : 76.00  
 Mean   :0.4342   Mean   : 9.923   Mean   :139   Mean   : 79.31  
 3rd Qu.:1.0000   3rd Qu.:10.155   3rd Qu.:143   3rd Qu.: 87.00  
 Max.   :1.0000   Max.   :12.731   Max.   :165   Max.   :207.00  
      bmi            asthma       active_asthma  father_asthma    
 Min.   :11.30   Min.   :0.0000   Min.   :0.00   Min.   :0.00000  
 1st Qu.:15.96   1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:0.00000  
 Median :17.81   Median :0.0000   Median :0.00   Median :0.00000  
 Mean   :18.50   Mean   :0.1425   Mean   :0.19   Mean   :0.07583  
 3rd Qu.:19.99   3rd Qu.:0.0000   3rd Qu.:0.00   3rd Qu.:0.00000  
 Max.   :41.27   Max.   :1.0000   Max.   :1.00   Max.   :1.00000  
 mother_asthma        wheeze          hayfever         allergy      
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   :0.0975   Mean   :0.3117   Mean   :0.1575   Mean   :0.2775  
 3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
  educ_parent        smoke             pets           gasstove     
 Min.   :1.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:1.0000  
 Median :3.000   Median :0.0000   Median :1.0000   Median :1.0000  
 Mean   :2.808   Mean   :0.1583   Mean   :0.7667   Mean   :0.7875  
 3rd Qu.:3.000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :5.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
      fev              fvc            mmef          pm25_mass     
 Min.   : 984.8   Min.   : 895   Min.   : 757.6   Min.   : 5.960  
 1st Qu.:1827.6   1st Qu.:2065   1st Qu.:2043.7   1st Qu.: 7.615  
 Median :2016.4   Median :2282   Median :2392.1   Median :10.545  
 Mean   :2030.1   Mean   :2322   Mean   :2398.6   Mean   :14.362  
 3rd Qu.:2223.6   3rd Qu.:2551   3rd Qu.:2737.1   3rd Qu.:20.988  
 Max.   :3323.7   Max.   :3698   Max.   :4935.9   Max.   :29.970  
    pm25_so4        pm25_no3         pm25_nh4         pm25_oc      
 Min.   :0.790   Min.   : 0.730   Min.   :0.4100   Min.   : 1.450  
 1st Qu.:1.077   1st Qu.: 1.538   1st Qu.:0.7375   1st Qu.: 2.520  
 Median :1.815   Median : 2.525   Median :1.1350   Median : 4.035  
 Mean   :1.876   Mean   : 4.488   Mean   :1.7642   Mean   : 4.551  
 3rd Qu.:2.605   3rd Qu.: 7.338   3rd Qu.:2.7725   3rd Qu.: 5.350  
 Max.   :3.230   Max.   :12.200   Max.   :4.2500   Max.   :11.830  
    pm25_ec          pm25_om          pm10_oc          pm10_ec      
 Min.   :0.1300   Min.   : 1.740   Min.   : 1.860   Min.   :0.1400  
 1st Qu.:0.4000   1st Qu.: 3.020   1st Qu.: 3.228   1st Qu.:0.4100  
 Median :0.5850   Median : 4.840   Median : 5.170   Median :0.5950  
 Mean   :0.7358   Mean   : 5.460   Mean   : 5.832   Mean   :0.7525  
 3rd Qu.:1.1750   3rd Qu.: 6.418   3rd Qu.: 6.855   3rd Qu.:1.1975  
 Max.   :1.3600   Max.   :14.200   Max.   :15.160   Max.   :1.3900  
    pm10_tc           formic          acetic           hcl        
 Min.   : 1.990   Min.   :0.340   Min.   :0.750   Min.   :0.2200  
 1st Qu.: 3.705   1st Qu.:0.720   1st Qu.:2.297   1st Qu.:0.3250  
 Median : 6.505   Median :1.105   Median :2.910   Median :0.4350  
 Mean   : 6.784   Mean   :1.332   Mean   :3.010   Mean   :0.4208  
 3rd Qu.: 8.430   3rd Qu.:1.765   3rd Qu.:4.000   3rd Qu.:0.4625  
 Max.   :16.440   Max.   :2.770   Max.   :5.140   Max.   :0.7300  
      hno3           o3_max          o3106           o3_24      
 Min.   :0.430   Min.   :38.27   Min.   :28.22   Min.   :18.22  
 1st Qu.:1.593   1st Qu.:49.93   1st Qu.:41.90   1st Qu.:23.31  
 Median :2.455   Median :64.05   Median :46.74   Median :27.59  
 Mean   :2.367   Mean   :60.16   Mean   :47.76   Mean   :30.23  
 3rd Qu.:3.355   3rd Qu.:67.69   3rd Qu.:55.24   3rd Qu.:32.39  
 Max.   :4.070   Max.   :84.44   Max.   :67.01   Max.   :57.76  
      no2             pm10          no_24hr          pm2_5_fr    
 Min.   : 4.60   Min.   :18.40   Min.   : 2.050   Min.   : 9.01  
 1st Qu.:12.12   1st Qu.:20.71   1st Qu.: 6.487   1st Qu.:13.47  
 Median :16.40   Median :29.64   Median :14.080   Median :19.94  
 Mean   :18.99   Mean   :32.64   Mean   :16.218   Mean   :19.80  
 3rd Qu.:23.24   3rd Qu.:39.16   3rd Qu.:20.585   3rd Qu.:25.93  
 Max.   :37.97   Max.   :70.39   Max.   :42.950   Max.   :31.55  
     iacid           oacid        total_acids          lon        
 Min.   :0.760   Min.   :1.090   Min.   : 1.520   Min.   :-120.7  
 1st Qu.:1.835   1st Qu.:2.978   1st Qu.: 4.930   1st Qu.:-118.8  
 Median :2.825   Median :4.135   Median : 6.370   Median :-117.7  
 Mean   :2.788   Mean   :4.342   Mean   : 6.708   Mean   :-118.3  
 3rd Qu.:3.817   3rd Qu.:5.982   3rd Qu.: 9.395   3rd Qu.:-117.4  
 Max.   :4.620   Max.   :7.400   Max.   :11.430   Max.   :-116.8  
      lat       
 Min.   :32.84  
 1st Qu.:33.93  
 Median :34.10  
 Mean   :34.20  
 3rd Qu.:34.65  
 Max.   :35.49  
  1. Create a new categorical variable named “obesity_level” using the BMI measurement (underweight BMI<14; normal BMI 14-22; overweight BMI 22-24; obese BMI>24).To make sure the variable is rightly coded, create a summary table that contains the minimum BMI, maximum BMI, and the total number of observations per category.
# creating the obesity_level variable based on BMI
merged_data <- merged_data %>%
  mutate(obesity_level = case_when(
    bmi < 14 ~ "underweight",
    bmi >= 14 & bmi <= 22 ~ "normal",
    bmi > 22 & bmi <= 24 ~ "overweight",
    bmi > 24 ~ "obese"
  ))

# checking if the obesity_level variable was created correctly
# summary table with min, max, and count per category
obesity_summary_table <- merged_data %>%
  group_by(obesity_level) %>%
  summarise(
    min_bmi = min(bmi, na.rm = TRUE),
    max_bmi = max(bmi, na.rm = TRUE),
    count = n()
  )

# display the summary table
print(obesity_summary_table)
# A tibble: 4 × 4
  obesity_level min_bmi max_bmi count
  <chr>           <dbl>   <dbl> <int>
1 normal           14.0    22.0   975
2 obese            24.0    41.3   103
3 overweight       22.0    24.0    87
4 underweight      11.3    14.0    35

There are 975 individuals in the normal category, 103 in the obese category, 87 in the overweight category, and 35 in the underweight category. Summed up, this is 1200, meaning that all observations were accounted for. The minimum and maximum of each obesity level also follow the cut-off values accordingly.

  1. Create another categorical variable named “smoke_gas_exposure” that summarizes “Second Hand Smoke” and “Gas Stove.” The variable should have four categories in total.
# creating the smoke_gas_exposure variable
merged_data <- merged_data %>%
  mutate(smoke_gas_exposure = case_when(
    smoke == 0 & gasstove == 0 ~ "No exposure",
    smoke == 1 & gasstove == 0 ~ "Smoke exposure only",
    smoke == 0 & gasstove == 1 ~ "Gas stove exposure only",
    smoke == 1 & gasstove == 1 ~ "Both exposures"
  ))

# summary table checking the counts in each category of smoke_gas_exposure
smoke_summary_table <- merged_data %>%
  group_by(smoke_gas_exposure) %>%
  summarise(
    count = n())

# display the summary table
print(smoke_summary_table)
# A tibble: 4 × 2
  smoke_gas_exposure      count
  <chr>                   <int>
1 Both exposures            154
2 Gas stove exposure only   791
3 No exposure               219
4 Smoke exposure only        36

The dataset reveals that the majority of individuals (791) were exposed to gas stove emissions only, while a smaller group of 154 individuals experienced both smoke and gas stove exposure. Additionally, 219 individuals had no exposure to either, and the smallest group, comprising just 36 individuals, was exposed solely to smoke. These totals sum to 1,200, confirming that all observations are accounted for.

  1. Create four summary tables showing the average (or proportion, if binary) and sd of “Forced expiratory volume in 1 second (ml)” (an asthma indicator) by town, sex, obesity level, and “smoke_gas_exposure.”
# function to generate summary table
generate_summary <- function(group_var) {
  merged_data %>%
    group_by(!!sym(group_var)) %>%
    summarise(
      mean_fev = mean(fev, na.rm = TRUE),
      sd_fev = sd(fev, na.rm = TRUE),
      n = n()
    )
}

# variables to group by
group_vars <- c("townname", "male", "obesity_level", "smoke_gas_exposure")

# loop through each grouping variable and generate summary table
summary_tables <- lapply(group_vars, function(var) {
  cat("\nSummary by", var, ":\n")
  print(generate_summary(var))
})

Summary by townname :
# A tibble: 12 × 4
   townname      mean_fev sd_fev     n
   <chr>            <dbl>  <dbl> <int>
 1 Alpine           2087.   291.   100
 2 Atascadero       2076.   324.   100
 3 Lake Elsinore    2039.   304.   100
 4 Lake Gregory     2085.   320.   100
 5 Lancaster        2003.   317.   100
 6 Lompoc           2034.   351.   100
 7 Long Beach       1986.   319.   100
 8 Mira Loma        1985.   325.   100
 9 Riverside        1990.   278.   100
10 San Dimas        2027.   319.   100
11 Santa Maria      2026.   312.   100
12 Upland           2024.   343.   100

Summary by male :
# A tibble: 2 × 4
   male mean_fev sd_fev     n
  <int>    <dbl>  <dbl> <int>
1     0    1959.   312.   610
2     1    2104.   308.   590

Summary by obesity_level :
# A tibble: 4 × 4
  obesity_level mean_fev sd_fev     n
  <chr>            <dbl>  <dbl> <int>
1 normal           2000.   295.   975
2 obese            2266.   325.   103
3 overweight       2224.   317.    87
4 underweight      1698.   303.    35

Summary by smoke_gas_exposure :
# A tibble: 4 × 4
  smoke_gas_exposure      mean_fev sd_fev     n
  <chr>                      <dbl>  <dbl> <int>
1 Both exposures             2025.   301.   154
2 Gas stove exposure only    2023.   319.   791
3 No exposure                2057.   329.   219
4 Smoke exposure only        2056.   296.    36

Looking at the Data (EDA)

The primary questions of interest are: What is the association between BMI and FEV (forced expiratory volume)? What is the association between smoke and gas exposure and FEV? What is the association between PM2.5 exposure and FEV?

Follow the EDA checklist from week 3 and the previous assignment. Be sure to focus on the key variables.

# check the dimensions of the data
dim(merged_data)  
[1] 1200   51
# check the structure of the data
str(merged_data)
tibble [1,200 × 51] (S3: tbl_df/tbl/data.frame)
 $ townname          : chr [1:1200] "Alpine" "Alpine" "Alpine" "Alpine" ...
 $ sid               : int [1:1200] 841 835 838 840 865 867 842 839 844 847 ...
 $ male              : int [1:1200] 1 0 0 0 0 0 1 0 1 1 ...
 $ race              : chr [1:1200] "W" "W" "O" "W" ...
 $ hispanic          : int [1:1200] 1 0 1 0 0 1 1 1 1 0 ...
 $ agepft            : num [1:1200] 10.55 10.1 9.49 9.97 10.04 ...
 $ height            : num [1:1200] 150 143 133 146 162 141 139 142 143 137 ...
 $ weight            : num [1:1200] 78 69 62 78 140 94 65 86 65 69 ...
 $ bmi               : num [1:1200] 15.8 15.3 15.9 16.6 24.2 ...
 $ asthma            : int [1:1200] 0 0 0 0 1 0 0 0 0 0 ...
 $ active_asthma     : int [1:1200] 0 0 0 0 1 0 0 0 0 0 ...
 $ father_asthma     : int [1:1200] 0 0 0 0 0 0 0 0 0 0 ...
 $ mother_asthma     : int [1:1200] 0 0 0 0 0 0 0 1 0 0 ...
 $ wheeze            : int [1:1200] 0 0 0 0 1 0 1 1 0 0 ...
 $ hayfever          : int [1:1200] 0 0 0 0 0 0 0 1 0 0 ...
 $ allergy           : int [1:1200] 0 1 0 0 1 0 0 1 0 0 ...
 $ educ_parent       : int [1:1200] 5 3 4 3 3 5 1 3 3 5 ...
 $ smoke             : int [1:1200] 0 0 0 0 0 0 1 1 0 0 ...
 $ pets              : int [1:1200] 1 1 1 0 1 1 1 1 0 1 ...
 $ gasstove          : int [1:1200] 0 0 0 1 1 1 0 0 1 1 ...
 $ fev               : num [1:1200] 2252 2529 1738 2467 2584 ...
 $ fvc               : num [1:1200] 2595 2826 1964 2638 3568 ...
 $ mmef              : num [1:1200] 2445 3407 2133 3466 2071 ...
 $ pm25_mass         : num [1:1200] 8.74 8.74 8.74 8.74 8.74 8.74 8.74 8.74 8.74 8.74 ...
 $ pm25_so4          : num [1:1200] 1.73 1.73 1.73 1.73 1.73 1.73 1.73 1.73 1.73 1.73 ...
 $ pm25_no3          : num [1:1200] 1.59 1.59 1.59 1.59 1.59 1.59 1.59 1.59 1.59 1.59 ...
 $ pm25_nh4          : num [1:1200] 0.88 0.88 0.88 0.88 0.88 0.88 0.88 0.88 0.88 0.88 ...
 $ pm25_oc           : num [1:1200] 2.54 2.54 2.54 2.54 2.54 2.54 2.54 2.54 2.54 2.54 ...
 $ pm25_ec           : num [1:1200] 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48 ...
 $ pm25_om           : num [1:1200] 3.04 3.04 3.04 3.04 3.04 3.04 3.04 3.04 3.04 3.04 ...
 $ pm10_oc           : num [1:1200] 3.25 3.25 3.25 3.25 3.25 3.25 3.25 3.25 3.25 3.25 ...
 $ pm10_ec           : num [1:1200] 0.49 0.49 0.49 0.49 0.49 0.49 0.49 0.49 0.49 0.49 ...
 $ pm10_tc           : num [1:1200] 3.75 3.75 3.75 3.75 3.75 3.75 3.75 3.75 3.75 3.75 ...
 $ formic            : num [1:1200] 1.03 1.03 1.03 1.03 1.03 1.03 1.03 1.03 1.03 1.03 ...
 $ acetic            : num [1:1200] 2.49 2.49 2.49 2.49 2.49 2.49 2.49 2.49 2.49 2.49 ...
 $ hcl               : num [1:1200] 0.41 0.41 0.41 0.41 0.41 0.41 0.41 0.41 0.41 0.41 ...
 $ hno3              : num [1:1200] 1.98 1.98 1.98 1.98 1.98 1.98 1.98 1.98 1.98 1.98 ...
 $ o3_max            : num [1:1200] 65.8 65.8 65.8 65.8 65.8 ...
 $ o3106             : num [1:1200] 55 55 55 55 55 ...
 $ o3_24             : num [1:1200] 41.2 41.2 41.2 41.2 41.2 ...
 $ no2               : num [1:1200] 12.2 12.2 12.2 12.2 12.2 ...
 $ pm10              : num [1:1200] 24.7 24.7 24.7 24.7 24.7 ...
 $ no_24hr           : num [1:1200] 2.48 2.48 2.48 2.48 2.48 2.48 2.48 2.48 2.48 2.48 ...
 $ pm2_5_fr          : num [1:1200] 10.3 10.3 10.3 10.3 10.3 ...
 $ iacid             : num [1:1200] 2.39 2.39 2.39 2.39 2.39 2.39 2.39 2.39 2.39 2.39 ...
 $ oacid             : num [1:1200] 3.52 3.52 3.52 3.52 3.52 3.52 3.52 3.52 3.52 3.52 ...
 $ total_acids       : num [1:1200] 5.5 5.5 5.5 5.5 5.5 5.5 5.5 5.5 5.5 5.5 ...
 $ lon               : num [1:1200] -117 -117 -117 -117 -117 ...
 $ lat               : num [1:1200] 32.8 32.8 32.8 32.8 32.8 ...
 $ obesity_level     : chr [1:1200] "normal" "normal" "normal" "normal" ...
 $ smoke_gas_exposure: chr [1:1200] "No exposure" "No exposure" "No exposure" "Gas stove exposure only" ...
# get summary statistics for numeric variables
summary(merged_data)
   townname              sid              male            race          
 Length:1200        Min.   :   1.0   Min.   :0.0000   Length:1200       
 Class :character   1st Qu.: 528.8   1st Qu.:0.0000   Class :character  
 Mode  :character   Median :1041.5   Median :0.0000   Mode  :character  
                    Mean   :1037.5   Mean   :0.4917                     
                    3rd Qu.:1554.2   3rd Qu.:1.0000                     
                    Max.   :2053.0   Max.   :1.0000                     
    hispanic          agepft           height        weight      
 Min.   :0.0000   Min.   : 8.961   Min.   :114   Min.   : 42.00  
 1st Qu.:0.0000   1st Qu.: 9.632   1st Qu.:135   1st Qu.: 66.00  
 Median :0.0000   Median : 9.907   Median :139   Median : 76.00  
 Mean   :0.4342   Mean   : 9.923   Mean   :139   Mean   : 79.31  
 3rd Qu.:1.0000   3rd Qu.:10.155   3rd Qu.:143   3rd Qu.: 87.00  
 Max.   :1.0000   Max.   :12.731   Max.   :165   Max.   :207.00  
      bmi            asthma       active_asthma  father_asthma    
 Min.   :11.30   Min.   :0.0000   Min.   :0.00   Min.   :0.00000  
 1st Qu.:15.96   1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:0.00000  
 Median :17.81   Median :0.0000   Median :0.00   Median :0.00000  
 Mean   :18.50   Mean   :0.1425   Mean   :0.19   Mean   :0.07583  
 3rd Qu.:19.99   3rd Qu.:0.0000   3rd Qu.:0.00   3rd Qu.:0.00000  
 Max.   :41.27   Max.   :1.0000   Max.   :1.00   Max.   :1.00000  
 mother_asthma        wheeze          hayfever         allergy      
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   :0.0975   Mean   :0.3117   Mean   :0.1575   Mean   :0.2775  
 3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
  educ_parent        smoke             pets           gasstove     
 Min.   :1.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:1.0000  
 Median :3.000   Median :0.0000   Median :1.0000   Median :1.0000  
 Mean   :2.808   Mean   :0.1583   Mean   :0.7667   Mean   :0.7875  
 3rd Qu.:3.000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :5.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
      fev              fvc            mmef          pm25_mass     
 Min.   : 984.8   Min.   : 895   Min.   : 757.6   Min.   : 5.960  
 1st Qu.:1827.6   1st Qu.:2065   1st Qu.:2043.7   1st Qu.: 7.615  
 Median :2016.4   Median :2282   Median :2392.1   Median :10.545  
 Mean   :2030.1   Mean   :2322   Mean   :2398.6   Mean   :14.362  
 3rd Qu.:2223.6   3rd Qu.:2551   3rd Qu.:2737.1   3rd Qu.:20.988  
 Max.   :3323.7   Max.   :3698   Max.   :4935.9   Max.   :29.970  
    pm25_so4        pm25_no3         pm25_nh4         pm25_oc      
 Min.   :0.790   Min.   : 0.730   Min.   :0.4100   Min.   : 1.450  
 1st Qu.:1.077   1st Qu.: 1.538   1st Qu.:0.7375   1st Qu.: 2.520  
 Median :1.815   Median : 2.525   Median :1.1350   Median : 4.035  
 Mean   :1.876   Mean   : 4.488   Mean   :1.7642   Mean   : 4.551  
 3rd Qu.:2.605   3rd Qu.: 7.338   3rd Qu.:2.7725   3rd Qu.: 5.350  
 Max.   :3.230   Max.   :12.200   Max.   :4.2500   Max.   :11.830  
    pm25_ec          pm25_om          pm10_oc          pm10_ec      
 Min.   :0.1300   Min.   : 1.740   Min.   : 1.860   Min.   :0.1400  
 1st Qu.:0.4000   1st Qu.: 3.020   1st Qu.: 3.228   1st Qu.:0.4100  
 Median :0.5850   Median : 4.840   Median : 5.170   Median :0.5950  
 Mean   :0.7358   Mean   : 5.460   Mean   : 5.832   Mean   :0.7525  
 3rd Qu.:1.1750   3rd Qu.: 6.418   3rd Qu.: 6.855   3rd Qu.:1.1975  
 Max.   :1.3600   Max.   :14.200   Max.   :15.160   Max.   :1.3900  
    pm10_tc           formic          acetic           hcl        
 Min.   : 1.990   Min.   :0.340   Min.   :0.750   Min.   :0.2200  
 1st Qu.: 3.705   1st Qu.:0.720   1st Qu.:2.297   1st Qu.:0.3250  
 Median : 6.505   Median :1.105   Median :2.910   Median :0.4350  
 Mean   : 6.784   Mean   :1.332   Mean   :3.010   Mean   :0.4208  
 3rd Qu.: 8.430   3rd Qu.:1.765   3rd Qu.:4.000   3rd Qu.:0.4625  
 Max.   :16.440   Max.   :2.770   Max.   :5.140   Max.   :0.7300  
      hno3           o3_max          o3106           o3_24      
 Min.   :0.430   Min.   :38.27   Min.   :28.22   Min.   :18.22  
 1st Qu.:1.593   1st Qu.:49.93   1st Qu.:41.90   1st Qu.:23.31  
 Median :2.455   Median :64.05   Median :46.74   Median :27.59  
 Mean   :2.367   Mean   :60.16   Mean   :47.76   Mean   :30.23  
 3rd Qu.:3.355   3rd Qu.:67.69   3rd Qu.:55.24   3rd Qu.:32.39  
 Max.   :4.070   Max.   :84.44   Max.   :67.01   Max.   :57.76  
      no2             pm10          no_24hr          pm2_5_fr    
 Min.   : 4.60   Min.   :18.40   Min.   : 2.050   Min.   : 9.01  
 1st Qu.:12.12   1st Qu.:20.71   1st Qu.: 6.487   1st Qu.:13.47  
 Median :16.40   Median :29.64   Median :14.080   Median :19.94  
 Mean   :18.99   Mean   :32.64   Mean   :16.218   Mean   :19.80  
 3rd Qu.:23.24   3rd Qu.:39.16   3rd Qu.:20.585   3rd Qu.:25.93  
 Max.   :37.97   Max.   :70.39   Max.   :42.950   Max.   :31.55  
     iacid           oacid        total_acids          lon        
 Min.   :0.760   Min.   :1.090   Min.   : 1.520   Min.   :-120.7  
 1st Qu.:1.835   1st Qu.:2.978   1st Qu.: 4.930   1st Qu.:-118.8  
 Median :2.825   Median :4.135   Median : 6.370   Median :-117.7  
 Mean   :2.788   Mean   :4.342   Mean   : 6.708   Mean   :-118.3  
 3rd Qu.:3.817   3rd Qu.:5.982   3rd Qu.: 9.395   3rd Qu.:-117.4  
 Max.   :4.620   Max.   :7.400   Max.   :11.430   Max.   :-116.8  
      lat        obesity_level      smoke_gas_exposure
 Min.   :32.84   Length:1200        Length:1200       
 1st Qu.:33.93   Class :character   Class :character  
 Median :34.10   Mode  :character   Mode  :character  
 Mean   :34.20                                        
 3rd Qu.:34.65                                        
 Max.   :35.49                                        
# view the first few rows
head(merged_data, 5)
# A tibble: 5 × 51
  townname   sid  male race  hispanic agepft height weight   bmi asthma
  <chr>    <int> <int> <chr>    <int>  <dbl>  <dbl>  <dbl> <dbl>  <int>
1 Alpine     841     1 W            1  10.5     150     78  15.8      0
2 Alpine     835     0 W            0  10.1     143     69  15.3      0
3 Alpine     838     0 O            1   9.49    133     62  15.9      0
4 Alpine     840     0 W            0   9.97    146     78  16.6      0
5 Alpine     865     0 W            0  10.0     162    140  24.2      1
# ℹ 41 more variables: active_asthma <int>, father_asthma <int>,
#   mother_asthma <int>, wheeze <int>, hayfever <int>, allergy <int>,
#   educ_parent <int>, smoke <int>, pets <int>, gasstove <int>, fev <dbl>,
#   fvc <dbl>, mmef <dbl>, pm25_mass <dbl>, pm25_so4 <dbl>, pm25_no3 <dbl>,
#   pm25_nh4 <dbl>, pm25_oc <dbl>, pm25_ec <dbl>, pm25_om <dbl>, pm10_oc <dbl>,
#   pm10_ec <dbl>, pm10_tc <dbl>, formic <dbl>, acetic <dbl>, hcl <dbl>,
#   hno3 <dbl>, o3_max <dbl>, o3106 <dbl>, o3_24 <dbl>, no2 <dbl>, …
# view the last few rows
tail(merged_data, 5)
# A tibble: 5 × 51
  townname   sid  male race  hispanic agepft height weight   bmi asthma
  <chr>    <int> <int> <chr>    <int>  <dbl>  <dbl>  <dbl> <dbl>  <int>
1 Upland    1867     0 M            1   9.62   140    71    16.5      0
2 Upland    2033     0 M            0  10.1    130    67    18.0      0
3 Upland    2031     1 W            0   9.80   135    83    20.7      0
4 Upland    2032     1 W            0   9.55   137    59    14.3      0
5 Upland    2053     0 W            0   9.85   139.   77.4  18.1      0
# ℹ 41 more variables: active_asthma <int>, father_asthma <int>,
#   mother_asthma <int>, wheeze <int>, hayfever <int>, allergy <int>,
#   educ_parent <int>, smoke <int>, pets <int>, gasstove <int>, fev <dbl>,
#   fvc <dbl>, mmef <dbl>, pm25_mass <dbl>, pm25_so4 <dbl>, pm25_no3 <dbl>,
#   pm25_nh4 <dbl>, pm25_oc <dbl>, pm25_ec <dbl>, pm25_om <dbl>, pm10_oc <dbl>,
#   pm10_ec <dbl>, pm10_tc <dbl>, formic <dbl>, acetic <dbl>, hcl <dbl>,
#   hno3 <dbl>, o3_max <dbl>, o3106 <dbl>, o3_24 <dbl>, no2 <dbl>, …
# check for missing values
colSums(is.na(merged_data))
          townname                sid               male               race 
                 0                  0                  0                  0 
          hispanic             agepft             height             weight 
                 0                  0                  0                  0 
               bmi             asthma      active_asthma      father_asthma 
                 0                  0                  0                  0 
     mother_asthma             wheeze           hayfever            allergy 
                 0                  0                  0                  0 
       educ_parent              smoke               pets           gasstove 
                 0                  0                  0                  0 
               fev                fvc               mmef          pm25_mass 
                 0                  0                  0                  0 
          pm25_so4           pm25_no3           pm25_nh4            pm25_oc 
                 0                  0                  0                  0 
           pm25_ec            pm25_om            pm10_oc            pm10_ec 
                 0                  0                  0                  0 
           pm10_tc             formic             acetic                hcl 
                 0                  0                  0                  0 
              hno3             o3_max              o3106              o3_24 
                 0                  0                  0                  0 
               no2               pm10            no_24hr           pm2_5_fr 
                 0                  0                  0                  0 
             iacid              oacid        total_acids                lon 
                 0                  0                  0                  0 
               lat      obesity_level smoke_gas_exposure 
                 0                  0                  0 

Checking Variables of Interest

summary(merged_data$obesity_level)
   Length     Class      Mode 
     1200 character character 
summary(merged_data$smoke_gas_exposure)
   Length     Class      Mode 
     1200 character character 
summary(merged_data$pm25_mass)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  5.960   7.615  10.545  14.362  20.988  29.970 

From the summary tables, we can see that the imputed variables of obesity_level and smoke_gas_exposure have no NA values, signifying that the data was imputed correctly. Likewise, we can also see that the third variable of interest, pm25_mass, has no missing values.

Association Between BMI and FEV

# scatter plot of BMI vs. FEV
ggplot(merged_data, aes(x = bmi, y = fev)) +
  geom_point(alpha = 0.7, color = "pink", size = 3) +
  geom_smooth(method = "lm", color = "turquoise", fill = "lightblue", se = TRUE) +
  labs(title = "Scatter Plot of BMI vs. Forced Expiratory Volume (ml)",
       x = "BMI",
       y = "Forced Expiratory Volume (ml)"
  )
`geom_smooth()` using formula = 'y ~ x'

# calculate correlation between BMI and FEV
cor_bmi_fev <- cor(merged_data$bmi, merged_data$fev, use = "complete.obs")
print(paste("Correlation between BMI and FEV:", cor_bmi_fev))
[1] "Correlation between BMI and FEV: 0.357112030981777"
# linear regression model
model_bmi_fev <- lm(fev ~ bmi, data = merged_data)
summary(model_bmi_fev)

Call:
lm(formula = fev ~ bmi, data = merged_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-989.47 -193.88  -11.32  179.43 1305.46 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1452.46      44.49   32.65   <2e-16 ***
bmi            31.23       2.36   13.23   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 297.2 on 1198 degrees of freedom
Multiple R-squared:  0.1275,    Adjusted R-squared:  0.1268 
F-statistic: 175.1 on 1 and 1198 DF,  p-value: < 2.2e-16

The correlation coefficient of 0.357 between BMI and FEV suggests a moderate positive linear relationship. This implies that higher BMI values are associated with increased FEV, but the strength of the association is not particularly robust. A correlation of this magnitude indicates that BMI accounts for some variation in FEV, but it is far from being the dominant predictor. The scatter plot shows this trend visually, with a slight upward slope in the fitted line, confirming the positive correlation. The R-squared value of 0.1275 means that about 12.75% of the variance in FEV is explained by BMI, indicating that while BMI is a significant predictor, it only accounts for a small portion of the variation in FEV. This result aligns with the correlation coefficient of 0.357, suggesting a moderate positive linear relationship as other factors not captured in this analysis may also play a significant role in determining FEV.

However, it’s essential to note that given this analysis is exploratory data analysis (EDA), we are primarily concerned with understanding associations and correlations rather than establishing statistical significance.

Association Between Smoke and Gas Exposure and FEV

# calculate mean FEV for each smoke/gas exposure group
mean_fev_data <- merged_data %>%
  group_by(smoke_gas_exposure) %>%
  summarize(mean_fev = mean(fev, na.rm = TRUE))

# boxplot of Smoke and Gas Exposure vs FEV
ggplot(merged_data, aes(x = smoke_gas_exposure, y = fev, fill = smoke_gas_exposure)) +
  geom_boxplot(alpha = 0.5, outlier.color = "black") +
  labs(title = "Boxplot of Smoke and Gas Exposure vs. Forced Expiratory Volume (ml)",
       x = "Smoke and Gas Exposure",
       y = "Forced Expiratory Volume (ml)") +
  guides(fill = guide_legend(title = "Smoke/Gas Exposure")) +
  theme_minimal(base_size = 10) +
  stat_summary(fun = mean, geom = "point", shape = 20, size = 3, color = "black", fill = "black") +
  geom_text(data = mean_fev_data, aes(x = smoke_gas_exposure, y = mean_fev, label = round(mean_fev, 1)),
            vjust = -0.75, color = "black", size = 3)

# perform ANOVA to test for association between smoke and gas exposure and FEV
anova_result <- aov(fev ~ smoke_gas_exposure, data = merged_data)
summary(anova_result)
                     Df    Sum Sq Mean Sq F value Pr(>F)
smoke_gas_exposure    3    226515   75505   0.746  0.525
Residuals          1196 121017980  101186               
# linear regression model
model_smoke_fev <- lm(fev ~ smoke_gas_exposure, data = merged_data)
summary(model_smoke_fev)

Call:
lm(formula = fev ~ smoke_gas_exposure, data = merged_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1037.8  -205.3   -12.0   190.7  1301.0 

Coefficients:
                                          Estimate Std. Error t value Pr(>|t|)
(Intercept)                               2024.778     25.633  78.991   <2e-16
smoke_gas_exposureGas stove exposure only   -2.107     28.017  -0.075    0.940
smoke_gas_exposureNo exposure               31.915     33.453   0.954    0.340
smoke_gas_exposureSmoke exposure only       30.936     58.888   0.525    0.599
                                             
(Intercept)                               ***
smoke_gas_exposureGas stove exposure only    
smoke_gas_exposureNo exposure                
smoke_gas_exposureSmoke exposure only        
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 318.1 on 1196 degrees of freedom
Multiple R-squared:  0.001868,  Adjusted R-squared:  -0.0006354 
F-statistic: 0.7462 on 3 and 1196 DF,  p-value: 0.5246

The box plot visually compares FEV across four smoke and gas exposure categories: both exposures, gas stove only, no exposure, and smoke exposure only. The median FEV values are relatively similar across all groups, ranging from about 2022 ml to 2056 ml. The interquartile ranges (IQRs) and the overall distribution of FEV also appear similar, indicating minimal variation between exposure groups. The small differences in medians suggest that smoke and gas exposure has little effect on FEV. The R-squared value of 0.0019 indicates that exposure explains less than 0.2% of the variation in FEV, suggesting other factors are more influential.

However, it’s essential to note that given we are conducting an EDA, we are to prioritize understanding associations and correlations rather than establishing statistical significance at this stage of the analysis.

Association Between PM2.5 Exposure and FEV

# calculate the correlation between PM2.5 exposure and FEV
cor_pm25_fev <- cor(merged_data$pm25_mass, merged_data$fev, use = "complete.obs")
print(paste("Correlation between PM2.5 and FEV:", cor_pm25_fev))
[1] "Correlation between PM2.5 and FEV: -0.0734131664601442"
# linear regression model
model_pm25_fev <- lm(fev ~ pm25_mass, data = merged_data)
summary(model_pm25_fev)

Call:
lm(formula = fev ~ pm25_mass, data = merged_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1070.63  -206.57   -13.33   195.87  1277.65 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2073.452     19.309 107.383   <2e-16 ***
pm25_mass     -3.016      1.184  -2.548    0.011 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 317.3 on 1198 degrees of freedom
Multiple R-squared:  0.005389,  Adjusted R-squared:  0.004559 
F-statistic: 6.492 on 1 and 1198 DF,  p-value: 0.01096
# scatter plot to visualize the relationship
ggplot(merged_data, aes(x = pm25_mass, y = fev)) +
  geom_point(alpha = 0.5, color = "pink") +
  geom_smooth(method = "lm", color = "coral", se = FALSE) +
  labs(title = "Scatter Plot of PM2.5 Exposure vs Forced Expiratory Volume (ml)",
       x = "PM2.5 Exposure (µg/m³)",
       y = "Forced Expiratory Volume (ml)") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

The scatter plot illustrates the relationship between PM2.5 exposure and FEV. The fitted line, which appears nearly horizontal, indicates a weak relationship between PM2.5 exposure and FEV. The correlation coefficient of approximately -0.073 suggests a very weak negative linear association, implying that changes in PM2.5 exposure are not meaningfully associated with changes in FEV.

As was the case for the previous plots, we are conducting an EDA so we choose to prioritize understanding associations and correlations rather than establishing statistical significance at this stage of the analysis.

Visualization

  1. Facet plot showing scatterplots with regression lines of BMI vs FEV by “townname”.
# calculate correlation coefficients by town
correlation_values <- merged_data %>%
  group_by(townname) %>%
  summarise(
    correlation = cor(fev, bmi, use = "complete.obs")
  )

# print correlation values
print(correlation_values)
# A tibble: 12 × 2
   townname      correlation
   <chr>               <dbl>
 1 Alpine              0.184
 2 Atascadero          0.383
 3 Lake Elsinore       0.545
 4 Lake Gregory        0.349
 5 Lancaster           0.383
 6 Lompoc              0.331
 7 Long Beach          0.410
 8 Mira Loma           0.371
 9 Riverside           0.257
10 San Dimas           0.444
11 Santa Maria         0.380
12 Upland              0.385
# plot BMI vs. FEV by town
merged_data %>%
ggplot(mapping = aes(x = bmi, y = fev, color = townname)) + 
  geom_point(alpha = 0.5)  + 
  geom_smooth(method = lm, color = "black") + 
  facet_wrap(~ townname) + 
  xlab("Body Mass Index") + 
  ylab("Forced Expiratory Volume (ml)") + 
  ggtitle("Scatterplots of BMI vs Forced Expiratory Volume (ml) by Town")
`geom_smooth()` using formula = 'y ~ x'

The facet plot visualizes the relationship between Body Mass Index (BMI) and Forced Expiratory Volume (FEV) across twelve towns, with each scatterplot showing a positive correlation between the two variables. The strength of the association varies by town, with Lake Elsinore (0.5446) showing the strongest positive correlation, indicating that higher BMI is more strongly associated with increased FEV in this town. Other towns like San Dimas (0.4443) and Long Beach (0.4104) also display moderate correlations. Weaker correlations are seen in Riverside (0.2571) and Alpine (0.1845), where BMI has a smaller effect on FEV. Overall, the data suggests a positive, though modest, relationship between BMI and lung function, with some variability across towns.

  1. Stacked histograms of FEV by BMI category and FEV by smoke/gas exposure. Use different color schemes than the ggplot default.

stacked histogram for FEV by BMI category

# stacked histogram for FEV by BMI category
ggplot(merged_data, aes(x = fev, fill = obesity_level)) +
  geom_histogram(position = "stack", bins = 30, alpha = 0.7) +
  scale_fill_manual(values = c("lightblue", "lightgreen", "lightcoral", "lightpink")) +
  labs(title = "Stacked Histogram of Forced Expiratory Volume (ml) by BMI Category",
       x = "Forced Expiratory Volume (ml)",
       y = "Count") +
  guides(fill = guide_legend(title = "Obesity Category"))

  theme_minimal()
List of 136
 $ line                            :List of 6
  ..$ colour       : chr "black"
  ..$ linewidth    : num 0.5
  ..$ linetype     : num 1
  ..$ lineend      : chr "butt"
  ..$ arrow        : logi FALSE
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_line" "element"
 $ rect                            :List of 5
  ..$ fill         : chr "white"
  ..$ colour       : chr "black"
  ..$ linewidth    : num 0.5
  ..$ linetype     : num 1
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_rect" "element"
 $ text                            :List of 11
  ..$ family       : chr ""
  ..$ face         : chr "plain"
  ..$ colour       : chr "black"
  ..$ size         : num 11
  ..$ hjust        : num 0.5
  ..$ vjust        : num 0.5
  ..$ angle        : num 0
  ..$ lineheight   : num 0.9
  ..$ margin       : 'margin' num [1:4] 0points 0points 0points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : logi FALSE
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ title                           : NULL
 $ aspect.ratio                    : NULL
 $ axis.title                      : NULL
 $ axis.title.x                    :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : num 1
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 2.75points 0points 0points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.title.x.top                :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : num 0
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 0points 2.75points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.title.x.bottom             : NULL
 $ axis.title.y                    :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : num 1
  ..$ angle        : num 90
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 2.75points 0points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.title.y.left               : NULL
 $ axis.title.y.right              :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : num 1
  ..$ angle        : num -90
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 0points 0points 2.75points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.text                       :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : chr "grey30"
  ..$ size         : 'rel' num 0.8
  ..$ hjust        : NULL
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.text.x                     :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : num 1
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 2.2points 0points 0points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.text.x.top                 :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : num 0
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 0points 2.2points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.text.x.bottom              : NULL
 $ axis.text.y                     :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : num 1
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 2.2points 0points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.text.y.left                : NULL
 $ axis.text.y.right               :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : num 0
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 0points 0points 2.2points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.text.theta                 : NULL
 $ axis.text.r                     :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : num 0.5
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 2.2points 0points 2.2points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.ticks                      : list()
  ..- attr(*, "class")= chr [1:2] "element_blank" "element"
 $ axis.ticks.x                    : NULL
 $ axis.ticks.x.top                : NULL
 $ axis.ticks.x.bottom             : NULL
 $ axis.ticks.y                    : NULL
 $ axis.ticks.y.left               : NULL
 $ axis.ticks.y.right              : NULL
 $ axis.ticks.theta                : NULL
 $ axis.ticks.r                    : NULL
 $ axis.minor.ticks.x.top          : NULL
 $ axis.minor.ticks.x.bottom       : NULL
 $ axis.minor.ticks.y.left         : NULL
 $ axis.minor.ticks.y.right        : NULL
 $ axis.minor.ticks.theta          : NULL
 $ axis.minor.ticks.r              : NULL
 $ axis.ticks.length               : 'simpleUnit' num 2.75points
  ..- attr(*, "unit")= int 8
 $ axis.ticks.length.x             : NULL
 $ axis.ticks.length.x.top         : NULL
 $ axis.ticks.length.x.bottom      : NULL
 $ axis.ticks.length.y             : NULL
 $ axis.ticks.length.y.left        : NULL
 $ axis.ticks.length.y.right       : NULL
 $ axis.ticks.length.theta         : NULL
 $ axis.ticks.length.r             : NULL
 $ axis.minor.ticks.length         : 'rel' num 0.75
 $ axis.minor.ticks.length.x       : NULL
 $ axis.minor.ticks.length.x.top   : NULL
 $ axis.minor.ticks.length.x.bottom: NULL
 $ axis.minor.ticks.length.y       : NULL
 $ axis.minor.ticks.length.y.left  : NULL
 $ axis.minor.ticks.length.y.right : NULL
 $ axis.minor.ticks.length.theta   : NULL
 $ axis.minor.ticks.length.r       : NULL
 $ axis.line                       : list()
  ..- attr(*, "class")= chr [1:2] "element_blank" "element"
 $ axis.line.x                     : NULL
 $ axis.line.x.top                 : NULL
 $ axis.line.x.bottom              : NULL
 $ axis.line.y                     : NULL
 $ axis.line.y.left                : NULL
 $ axis.line.y.right               : NULL
 $ axis.line.theta                 : NULL
 $ axis.line.r                     : NULL
 $ legend.background               : list()
  ..- attr(*, "class")= chr [1:2] "element_blank" "element"
 $ legend.margin                   : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
  ..- attr(*, "unit")= int 8
 $ legend.spacing                  : 'simpleUnit' num 11points
  ..- attr(*, "unit")= int 8
 $ legend.spacing.x                : NULL
 $ legend.spacing.y                : NULL
 $ legend.key                      : list()
  ..- attr(*, "class")= chr [1:2] "element_blank" "element"
 $ legend.key.size                 : 'simpleUnit' num 1.2lines
  ..- attr(*, "unit")= int 3
 $ legend.key.height               : NULL
 $ legend.key.width                : NULL
 $ legend.key.spacing              : 'simpleUnit' num 5.5points
  ..- attr(*, "unit")= int 8
 $ legend.key.spacing.x            : NULL
 $ legend.key.spacing.y            : NULL
 $ legend.frame                    : NULL
 $ legend.ticks                    : NULL
 $ legend.ticks.length             : 'rel' num 0.2
 $ legend.axis.line                : NULL
 $ legend.text                     :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : 'rel' num 0.8
  ..$ hjust        : NULL
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ legend.text.position            : NULL
 $ legend.title                    :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : num 0
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ legend.title.position           : NULL
 $ legend.position                 : chr "right"
 $ legend.position.inside          : NULL
 $ legend.direction                : NULL
 $ legend.byrow                    : NULL
 $ legend.justification            : chr "center"
 $ legend.justification.top        : NULL
 $ legend.justification.bottom     : NULL
 $ legend.justification.left       : NULL
 $ legend.justification.right      : NULL
 $ legend.justification.inside     : NULL
 $ legend.location                 : NULL
 $ legend.box                      : NULL
 $ legend.box.just                 : NULL
 $ legend.box.margin               : 'margin' num [1:4] 0cm 0cm 0cm 0cm
  ..- attr(*, "unit")= int 1
 $ legend.box.background           : list()
  ..- attr(*, "class")= chr [1:2] "element_blank" "element"
 $ legend.box.spacing              : 'simpleUnit' num 11points
  ..- attr(*, "unit")= int 8
  [list output truncated]
 - attr(*, "class")= chr [1:2] "theme" "gg"
 - attr(*, "complete")= logi TRUE
 - attr(*, "validate")= logi TRUE

In the stacked histogram of FEV by BMI category, we observe that the distribution of FEV shifts based on obesity levels. Only the “normal” obesity category shows a relatively normal distribution, the “obese” group slightly normal but centered at a different value in comparison. The “normal” BMI group is centered around an FEV of 1800-2100 ml, while the “underweight” group has most of its FEV values between 1500-2000 ml. However, the “underweight” category has the fewest observations, which could affect its distribution in this dataset as it doesn’t appear normal. Both the “obese” and “overweight” groups have their FEV distributions centered around 2000-2500 ml. Additionally, we notice a potential outlier in the “obese” group with an FEV above 3250 ml, which contrasts with the group’s typical FEV range. Overall, The stacked histogram demonstrates that while the centers of the distributions for each obesity group are distinct, there is still considerable overlap in the FEV ranges across groups. For example, individuals in the “normal,” “overweight,” and “obese” categories can all have FEVs around 2000 ml.

Stacked histogram for FEV by smoke/gas exposure

# stacked histogram for FEV by smoke/gas exposure
ggplot(merged_data, aes(x = fev, fill = smoke_gas_exposure)) +
  geom_histogram(position = "stack", bins = 30, alpha = 0.7) +
  scale_fill_manual(values = c("lightblue", "lightgreen", "lightcoral", "lightpink")) +
  labs(title = "Stacked Histogram of Forced Expiratory Volume (ml) by Smoke/Gas Exposure",
       x = "Forced Expiratory Volume (ml)",
       y = "Count") +
  guides(fill = guide_legend(title = "Smoke/Gas Exposure"))

  theme_minimal()
List of 136
 $ line                            :List of 6
  ..$ colour       : chr "black"
  ..$ linewidth    : num 0.5
  ..$ linetype     : num 1
  ..$ lineend      : chr "butt"
  ..$ arrow        : logi FALSE
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_line" "element"
 $ rect                            :List of 5
  ..$ fill         : chr "white"
  ..$ colour       : chr "black"
  ..$ linewidth    : num 0.5
  ..$ linetype     : num 1
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_rect" "element"
 $ text                            :List of 11
  ..$ family       : chr ""
  ..$ face         : chr "plain"
  ..$ colour       : chr "black"
  ..$ size         : num 11
  ..$ hjust        : num 0.5
  ..$ vjust        : num 0.5
  ..$ angle        : num 0
  ..$ lineheight   : num 0.9
  ..$ margin       : 'margin' num [1:4] 0points 0points 0points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : logi FALSE
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ title                           : NULL
 $ aspect.ratio                    : NULL
 $ axis.title                      : NULL
 $ axis.title.x                    :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : num 1
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 2.75points 0points 0points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.title.x.top                :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : num 0
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 0points 2.75points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.title.x.bottom             : NULL
 $ axis.title.y                    :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : num 1
  ..$ angle        : num 90
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 2.75points 0points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.title.y.left               : NULL
 $ axis.title.y.right              :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : num 1
  ..$ angle        : num -90
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 0points 0points 2.75points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.text                       :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : chr "grey30"
  ..$ size         : 'rel' num 0.8
  ..$ hjust        : NULL
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.text.x                     :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : num 1
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 2.2points 0points 0points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.text.x.top                 :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : num 0
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 0points 2.2points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.text.x.bottom              : NULL
 $ axis.text.y                     :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : num 1
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 2.2points 0points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.text.y.left                : NULL
 $ axis.text.y.right               :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : num 0
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 0points 0points 2.2points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.text.theta                 : NULL
 $ axis.text.r                     :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : num 0.5
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 2.2points 0points 2.2points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.ticks                      : list()
  ..- attr(*, "class")= chr [1:2] "element_blank" "element"
 $ axis.ticks.x                    : NULL
 $ axis.ticks.x.top                : NULL
 $ axis.ticks.x.bottom             : NULL
 $ axis.ticks.y                    : NULL
 $ axis.ticks.y.left               : NULL
 $ axis.ticks.y.right              : NULL
 $ axis.ticks.theta                : NULL
 $ axis.ticks.r                    : NULL
 $ axis.minor.ticks.x.top          : NULL
 $ axis.minor.ticks.x.bottom       : NULL
 $ axis.minor.ticks.y.left         : NULL
 $ axis.minor.ticks.y.right        : NULL
 $ axis.minor.ticks.theta          : NULL
 $ axis.minor.ticks.r              : NULL
 $ axis.ticks.length               : 'simpleUnit' num 2.75points
  ..- attr(*, "unit")= int 8
 $ axis.ticks.length.x             : NULL
 $ axis.ticks.length.x.top         : NULL
 $ axis.ticks.length.x.bottom      : NULL
 $ axis.ticks.length.y             : NULL
 $ axis.ticks.length.y.left        : NULL
 $ axis.ticks.length.y.right       : NULL
 $ axis.ticks.length.theta         : NULL
 $ axis.ticks.length.r             : NULL
 $ axis.minor.ticks.length         : 'rel' num 0.75
 $ axis.minor.ticks.length.x       : NULL
 $ axis.minor.ticks.length.x.top   : NULL
 $ axis.minor.ticks.length.x.bottom: NULL
 $ axis.minor.ticks.length.y       : NULL
 $ axis.minor.ticks.length.y.left  : NULL
 $ axis.minor.ticks.length.y.right : NULL
 $ axis.minor.ticks.length.theta   : NULL
 $ axis.minor.ticks.length.r       : NULL
 $ axis.line                       : list()
  ..- attr(*, "class")= chr [1:2] "element_blank" "element"
 $ axis.line.x                     : NULL
 $ axis.line.x.top                 : NULL
 $ axis.line.x.bottom              : NULL
 $ axis.line.y                     : NULL
 $ axis.line.y.left                : NULL
 $ axis.line.y.right               : NULL
 $ axis.line.theta                 : NULL
 $ axis.line.r                     : NULL
 $ legend.background               : list()
  ..- attr(*, "class")= chr [1:2] "element_blank" "element"
 $ legend.margin                   : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
  ..- attr(*, "unit")= int 8
 $ legend.spacing                  : 'simpleUnit' num 11points
  ..- attr(*, "unit")= int 8
 $ legend.spacing.x                : NULL
 $ legend.spacing.y                : NULL
 $ legend.key                      : list()
  ..- attr(*, "class")= chr [1:2] "element_blank" "element"
 $ legend.key.size                 : 'simpleUnit' num 1.2lines
  ..- attr(*, "unit")= int 3
 $ legend.key.height               : NULL
 $ legend.key.width                : NULL
 $ legend.key.spacing              : 'simpleUnit' num 5.5points
  ..- attr(*, "unit")= int 8
 $ legend.key.spacing.x            : NULL
 $ legend.key.spacing.y            : NULL
 $ legend.frame                    : NULL
 $ legend.ticks                    : NULL
 $ legend.ticks.length             : 'rel' num 0.2
 $ legend.axis.line                : NULL
 $ legend.text                     :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : 'rel' num 0.8
  ..$ hjust        : NULL
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ legend.text.position            : NULL
 $ legend.title                    :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : num 0
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ legend.title.position           : NULL
 $ legend.position                 : chr "right"
 $ legend.position.inside          : NULL
 $ legend.direction                : NULL
 $ legend.byrow                    : NULL
 $ legend.justification            : chr "center"
 $ legend.justification.top        : NULL
 $ legend.justification.bottom     : NULL
 $ legend.justification.left       : NULL
 $ legend.justification.right      : NULL
 $ legend.justification.inside     : NULL
 $ legend.location                 : NULL
 $ legend.box                      : NULL
 $ legend.box.just                 : NULL
 $ legend.box.margin               : 'margin' num [1:4] 0cm 0cm 0cm 0cm
  ..- attr(*, "unit")= int 1
 $ legend.box.background           : list()
  ..- attr(*, "class")= chr [1:2] "element_blank" "element"
 $ legend.box.spacing              : 'simpleUnit' num 11points
  ..- attr(*, "unit")= int 8
  [list output truncated]
 - attr(*, "class")= chr [1:2] "theme" "gg"
 - attr(*, "complete")= logi TRUE
 - attr(*, "validate")= logi TRUE

This histogram shows FEV distribution across four categories of smoke and gas exposure: both exposures, gas stove exposure only, no exposure, and smoke exposure only. The largest counts fall within the “gas stove exposure only” category, which dominates the central portion of the FEV range, from 1500 to 3000 ml. The “no exposure” category is more evenly spread, while “both exposures” and “smoke exposure only” have smaller counts overall. Although these distributions appear normal, there is little distinction in where they are centered based on smoke or gas exposure. This suggests there isn’t compelling evidence from this dataset that smoke or gas exposure significantly influences FEV, as the differences between the exposure groups are not as pronounced as those seen in the BMI categories.

  1. Barchart of BMI by smoke/gas exposure.
merged_data %>%
  ggplot(aes(x = obesity_level, fill = smoke_gas_exposure)) +
  geom_bar(position = "dodge", alpha = 0.5) +
  scale_fill_manual(values = c("lightblue", "lightgreen", "lightcoral", "lightpink")) + 
  labs(
    title = "Bar Chart of BMI by Smoke and Gas Exposure",
    x = "BMI Categories",
    y = "Count",
    fill = "Smoke and Gas Exposure"
  ) +
  theme_minimal()

The bar chart illustrates the distribution of BMI categories (normal, obese, overweight, underweight) based on exposure to smoke and gas from stoves. The data reveals that individuals with normal BMI are predominantly exposed to gas stove only, with a notable portion having no exposure. In contrast, the number of individuals exposed to both smoke and gas or smoke alone is significantly lower across all BMI categories. Among those classified as obese or overweight, the trend remains consistent, with gas stove exposure being the most common factor, followed by no exposure. In the underweight category, there is a smaller population, primarily exposed to gas stove only as is the case across te other BMI categories. From this plot, there is no evidence that obesity level varies by smoke or gas exposure, as the distribution of smoke and gas exposures seems consistent across obesity groups.

  1. Statistical summary graphs of FEV by BMI and FEV by smoke/gas exposure category. ## Statistical summary graph of FEV by BMI
# calculate mean and standard deviation of FEV by BMI
bmi_summary <- merged_data %>%
  group_by(obesity_level) %>%
  summarise(mean_fev = mean(fev, na.rm = TRUE)) %>%
  ungroup()
  
# scatter plot with summary
ggplot(data = merged_data, aes(x = obesity_level, y = fev, fill = obesity_level)) + 
  geom_boxplot(alpha = 0.5, outlier.color = "black") +  
  labs(title = "Boxplot of Forced Expiratory Volume (ml) by Obesity Level",
       x = "Obesity Level",
       y = "Forced Expiratory Volume (ml)") +  
  guides(fill = guide_legend(title = "Obesity Category")) +
  scale_fill_brewer(palette = "Pastel2") +  # Added missing `+`
  geom_text(data = bmi_summary, aes(x = obesity_level, y = mean_fev, label = round(mean_fev, 1)),
            vjust = -0.75, color = "black", size = 3) +  
  theme_minimal(base_size = 15)

This boxplot shows that individuals classified as “obese’ have the highest median FEV (2266.2 ml), followed by those who are “overweight” (2224.3 ml). People with normal BMI have a lower median FEV (1999.8 ml), while the “underweight” group has the lowest median value (1698.3 ml). The data also reveal that “normal”-weight and “obese”“ individuals exhibit greater variability in FEV, as evidenced by the presence of more outliers in these groups.

Statistical summary graph of FEV by smoke/gas exposure category

# calculate mean and standard deviation of FEV by smoke/gas exposure category
smokegas_summary <- merged_data %>%
  group_by(smoke_gas_exposure) %>%
  summarise(mean_fev = mean(fev, na.rm = TRUE))

# boxplot of FEV by  smoke/gas exposure category
ggplot(data = merged_data, aes(x = smoke_gas_exposure, y = fev, fill = smoke_gas_exposure)) + 
  geom_boxplot(alpha = 0.5, outlier.color = "black") +  
  labs(title = "Boxplot of Forced Expiratory Volume (ml) by Smoke and Gas Exposure Category",
       x = "Smoke and Gas Exposure",
       y = "Forced Expiratory Volume (ml)") +  
  guides(fill = guide_legend(title = "Smoke/Gas Exposure Category")) +  # Corrected legend title
  scale_fill_brewer(palette = "Set3") +
  geom_text(data = smokegas_summary, aes(x = smoke_gas_exposure, y = mean_fev, label = round(mean_fev, 1)),
            vjust = -0.75, color = "black", size = 3) +  
  theme_minimal(base_size = 10)

In contrast to the previous boxplot, there is minimal difference in the median FEV across the smoke and gas exposure categories. Those with no exposure (2056.7 ml) and smoke exposure only (2055.7 ml) have slightly higher FEV values compared to individuals with both exposures (2024.8 ml) or gas stove exposure only (2022.7 ml). The overall spread of FEV is similar across all exposure groups, though both the no exposure and both exposures categories show a higher number of outliers, indicating that other factors may contribute to variability in lung function.

In summary, BMI appears to have a more noticeable impact on FEV, with obese and overweight individuals displaying higher lung function. On the other hand, differences in smoke and gas exposure seem to have a smaller influence on FEV given overlapping data, though some variability exists across the exposure categories.

  1. A leaflet map showing the concentrations of PM2.5 mass in each of the CHS communities.
# create a color palette based on PM2.5 mass concentration
pal <- colorNumeric(palette = "viridis", domain = merged_data$pm25_mass)

# create the leaflet map
leaflet(merged_data) %>%
  addTiles() %>%  
  addCircleMarkers(
    ~lon, ~lat,  
    radius = 5,  
    color = ~pal(pm25_mass), 
    stroke = FALSE,  
    fillOpacity = 0.6,  
    popup = ~paste("PM2.5 Mass Concentration: ", round(pm25_mass, 2))  
  ) %>%
  addLegend("bottomright", pal = pal, values = ~pm25_mass,
            title = "PM2.5 Mass Concentration",
            opacity = 1)

Based on the map, it appears that site with the highest PM2.5 mass is located at one of the more eastern locations in California. We can retrieve this value among others.

unique(merged_data$townname[merged_data$pm25_mass == max(merged_data$pm25_mass)])
[1] "Mira Loma"
unique(merged_data$townname[merged_data$pm25_mass > mean(merged_data$pm25_mass)])
[1] "Long Beach" "Mira Loma"  "Riverside"  "San Dimas"  "Upland"    

Mira Loma exhibits the highest PM2.5 mass levels in this dataset. Based on online sources, it is recognized as one of the most polluted cities in Southern California in terms of PM2.5 pollution, largely due to its close proximity to the Ontario Freeway. This observation is corroborated by the data in this dataset.

Other cities with above-average PM2.5 levels include Long Beach, Riverside, San Dimas, and Upland. Like Mira Loma, these cities are located near major freeways, which contributes to elevated PM2.5 levels. In contrast, areas with lower PM2.5 concentrations are seen to be closer to the coast or situated further inland.

  1. Choose a visualization to examine whether PM2.5 mass is associated with FEV.
# calculate correlation coefficient
cor_pm25_fev <- cor(merged_data$pm25_mass, merged_data$fev, use = "complete.obs")
print(paste("Correlation between PM2.5 Mass and FEV:", cor_pm25_fev))
[1] "Correlation between PM2.5 Mass and FEV: -0.0734131664601442"
# scatterplot with smoothing line
ggplot(data = merged_data, mapping = aes(x = pm25_mass, y = fev)) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "loess", col = "pink", se = FALSE) +  
  labs(title = "Scatterplot of PM2.5 Mass vs Forced Expiratory Volume (ml)", 
       x = "PM2.5 Mass (µg/m³)", 
       y = "Forced Expiratory Volume (ml)") + 
  xlim(5.96, 29.97) +
  annotate("text", x = 10, y = max(merged_data$fev, na.rm = TRUE), 
           label = paste("Correlation Coefficient:", round(cor_pm25_fev, 2)), color = "black", size = 4)
`geom_smooth()` using formula = 'y ~ x'

A scatter plot is the ideal choice to examine the relationship between PM2.5 mass and FEV because both are continuous variables. The scatter plot reveals a slight negative relationship, although weak, between PM2.5 mass and FEV, as indicated by the downward slope of the linear regression line with a value of -0.07. Notably, the mean FEV1 for the participants exposed to <= 10 µg/m³ was the highest and those exposed to 30 µg/m³ was the lowest. This suggests that as PM2.5 mass increases, FEV tends to decrease slightly, indicating a possible adverse effect of air pollution on lung function.